This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how structural aspects of education (access, staffing, and outcomes) have evolved over time and how changes in one variable may influence changes in another. We analyze variables across different categories and examine differences between sexes. By focusing on longitudinal, cross-national data, we aim to identify potential patterns and meaningful interactions that could inform educational policy and investment.
The analysis of various indicators allows us to examine temporal
changes and potential external factors that might shape educational
outcomes.
We selected two major indicator categories that reflect
different dimensions of the education system:
Enrollment Rate vs. Attendance Rate
We will compare enrollment and attendance rates across different levels
of schooling (primary, secondary, tertiary) to test whether higher
enrollment leads to consistent attendance over time.
Number of Teachers vs. Teaching Staff
Compensation
We will investigate how changes in staffing levels correspond with
compensation across educational stages.
When we were brainstorming datasets to go with, we had difficulty choosing a topic because the possibilities were endless. We looked at several options from public health to climate to economics, but we eventually narrowed it down to education, which was something that all of us are connected to and that our classmates and readers could easily relate to. We initially considered focusing on Emory-specific data but broadened our scope to the World Bank’s Education Statistics for its larger context and long-term coverage.
As students at an American university, we were especially interested in examining data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment, attendance, and staffing can reveal whether increased investment in education translates into better participation and equity. This directly relates to us, as college students who have undergone the three main levels of education and were told that our own efforts, most notably in high school, would determine our outcomes later in life. But how much do individual outcomes actually depend on the student, and how much on the school systems and structures that support them?
We will perform Exploratory Data Analysis (EDA) in R
using the tidyverse suite of libraries.
Our analysis will include the following components:
Before proceeding, we load all necessary packages and set global options for reproducibility and consistent output formatting.
# install.packages("tidyverse")
# install.packages("readr")
# install.packages("knitr")
# install.packages("plotly")
library(tidyverse) # core data manipulation & visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr) # efficient CSV import
library(knitr)
library(plotly) # interactive data visualization
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(janitor) # column name cleaning & simple tabulations
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate) # data parsing and time handling
set.seed(42)
opts_chunk$set(echo=TRUE, message=FALSE, warning=FALSE)
message("All packages loaded.")
## All packages loaded.
We start from reading the raw EdStats dataset
(EdStats_v01.csv) and filter only the United
States(Country code == "USA"). We write the filtered
dataset to EdStats_USA.csv so later chunks only work with
US data.
We also drop fully empty columns (2024 column) and write
each subset to a CSV for reuse in later chunks.
This simplifies the dataset to the country of interest and removes all-NA columns to keep downstream transformations clean.
#set up data for use
data <- read_csv("EdStats_v01.csv")
#clean up data to have USA data
data_clean <- data %>%
filter(grepl("USA", `Country code`)) %>%
select(-`2024`) #all na
#make sure no NA cols remain
colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]
## character(0)
write_csv(data_clean, "EdStats_USA.csv")
To keep downstream steps tidy and reproducible, we create three thematic subsets aligned to our research questions:
EdStats_attend.csv.EdStats_teacher.csv.This ensures we work with clean, directly comparable variables.
# 1. Enrollment vs. Attendance
# filter for enrollment/attendance info
data <- read_csv("EdStats_USA.csv")
data_filter_attend <- data %>%
filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("female|male", `Indicator name`)) %>%
filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
select(where(~!all(is.na(.)))) %>% view() %>%
write_csv("EdStats_attend.csv")
# 2. Teachers vs. Compensation
#filter for num teachers and teaching staff compensation info
data <- read_csv("EdStats_USA.csv")
data_filter_teacher <- data %>%
filter(grepl("teachers in|teaching staff compensation", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("female|male", `Indicator name`)) %>%
filter(!grepl("percentage of qualified", `Indicator name`, ignore.case = TRUE)) %>%
select(where(~!all(is.na(.)))) %>% view() %>%
write_csv("EdStats_teacher.csv")
The following chunk expands our analysis of
EdStats_attend.csv to quantify and visualize the
relationship between school enrollment and
attendance across education levels in the United
States.
After reshaping the dataset into tidy long form, we standardize
indicator names to capture both spellings of “enrolment/enrollment” and
collapse duplicate series to obtain a single observation for each
year × level × type combination.
From these, we compute a new variable representing the
attendance gap (Enrollment – Attendance) to
measure how much access (enrollment) translates into sustained
participation (attendance).
tendency_attend.csv for
reproducibility.Reading tip:
Focus on whether attendance consistently trails enrollment and whether
that gap changes across levels or decades. High correlation indicates
structural progress—both metrics rising together—while a persistent
positive gap suggests that access alone does not guarantee
participation.
data_attend <- readr::read_csv("EdStats_attend.csv", show_col_types = FALSE)
# 1) Long format with robust labeling (catch both 'enrolment' and 'enrollment' spellings)
data_long <- data_attend %>%
pivot_longer(cols = matches("^\\d{4}$"), names_to = "Year", values_to = "Value") %>%
mutate(
Year = as.numeric(Year),
# Standardize type labels
Type = case_when(
str_detect(`Indicator name`, regex("attendance", ignore_case = TRUE)) ~ "Attendance",
str_detect(`Indicator name`, regex("enrol?ment", ignore_case = TRUE)) ~ "Enrollment", # enrolment or enrollment
TRUE ~ NA_character_
),
# Standardize level labels
Level = case_when(
str_detect(`Indicator name`, regex("primary", ignore_case = TRUE)) ~ "Primary",
str_detect(`Indicator name`, regex("lower\\s*secondary", ignore_case = TRUE)) ~ "Middle",
str_detect(`Indicator name`, regex("upper\\s*secondary", ignore_case = TRUE)) ~ "High",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(Type), !is.na(Level), !is.na(Value))
# 2) Collapse duplicates within (Year, Level, Type)
# (If the catalog has multiple series for the same concept, average them.)
collapsed <- data_long %>%
group_by(Level, Year, Type) %>%
summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop")
# 3) Wide format with both columns side-by-side
gap_df <- collapsed %>%
pivot_wider(names_from = Type, values_from = Value) %>%
mutate(Gap = Enrollment - Attendance)
# Quick diagnostics (see what exists)
print(table(gap_df$Level, useNA = "ifany"))
##
## High Middle Primary
## 14 18 22
print(summary(gap_df[, c("Enrollment", "Attendance")]))
## Enrollment Attendance
## Min. :85.90 Min. :94.64
## 1st Qu.:96.25 1st Qu.:96.55
## Median :97.77 Median :97.22
## Mean :97.07 Mean :97.10
## 3rd Qu.:99.17 3rd Qu.:98.00
## Max. :99.97 Max. :98.32
## NA's :3 NA's :36
# If there are no overlapping rows, bail out gracefully
if (nrow(drop_na(gap_df, Enrollment, Attendance)) < 2) {
warning("No overlapping Enrollment & Attendance observations after cleaning; skipping correlation and scatter.")
} else {
# Correlation and scatter only when data exist
corr_df <- gap_df %>% drop_na(Enrollment, Attendance)
r_val <- cor(corr_df$Enrollment, corr_df$Attendance, use = "pairwise.complete.obs", method = "pearson")
message(sprintf("Correlation between Enrollment and Attendance: r = %.2f", r_val))
p_scatter <- ggplot(corr_df, aes(Enrollment, Attendance, color = Level)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
labs(
title = sprintf("Attendance vs. Enrollment (r = %.2f)", r_val),
x = "Enrollment Rate (%)", y = "Attendance Rate (%)"
) +
theme_minimal(base_size = 12)
print(p_scatter)
}
# 4) Trend lines (unchanged, but use 'collapsed' to avoid duplicates)
p_trend <- ggplot(collapsed, aes(Year, Value, color = factor(..group..))) +
geom_line(linewidth = 1, na.rm = TRUE, aes(color = after_stat(NULL))) # placeholder; we'll fix color below
# Better: rebuild trend with explicit mapping
p_trend <- ggplot(collapsed, aes(Year, Value, color = Type)) +
geom_line(linewidth = 1, na.rm = TRUE) +
geom_point(size = 1.5, alpha = 0.7) +
facet_wrap(~ Level, ncol = 3) +
scale_color_manual(values = c("Attendance" = "#FF69B4", "Enrollment" = "#3BBF8F")) +
labs(title = "Attendance vs Enrollment over Time (USA)", y = "Rate (%)", color = "Series") +
theme_minimal(base_size = 12)
print(p_trend)
# 5) Gap over time (only where both exist)
if (any(!is.na(gap_df$Gap))) {
p_gap <- ggplot(gap_df, aes(Year, Gap, color = Level)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_line(linewidth = 1, na.rm = TRUE) +
geom_point(size = 1.3, alpha = 0.8) +
labs(title = "Enrollment – Attendance Gap (percentage points)", y = "Gap (pp)", x = "Year") +
theme_minimal(base_size = 12)
print(p_gap)
}
# 6) Summary stats (use collapsed to avoid duplicate inflation)
summary_stats <- collapsed %>%
group_by(Type, Level) %>%
summarize(
n = n(),
mean_rate = mean(Value, na.rm = TRUE),
median_rate = median(Value, na.rm = TRUE),
sd_rate = sd(Value, na.rm = TRUE),
min_rate = min(Value, na.rm = TRUE),
max_rate = max(Value, na.rm = TRUE),
IQR = IQR(Value, na.rm = TRUE),
.groups = "drop"
)
readr::write_csv(summary_stats, "tendency_attend.csv")
summary_stats
## # A tibble: 6 × 9
## Type Level n mean_rate median_rate sd_rate min_rate max_rate IQR
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Attendance High 6 96.4 96.6 0.583 95.3 96.8 0.383
## 2 Attendance Middle 6 97.7 98.1 0.966 95.8 98.3 0.493
## 3 Attendance Primary 6 97.2 97.8 1.32 94.6 98.0 0.805
## 4 Enrollment High 13 94.4 94.9 2.46 90.9 98.8 4.17
## 5 Enrollment Middle 17 98.8 99.2 1.11 96.3 100.0 1.05
## 6 Enrollment Primary 21 97.3 98.0 2.85 85.9 99.6 1.24
This section examines how teacher workforce size and
compensation have evolved in the United States from
2014 to 2021.
Using EdStats_teacher.csv, we identify indicators relating
to the number of teachers and the share of
educational expenditure allocated to teaching staff
compensation.
After filtering out non-teaching staff and qualification percentages,
the data are reshaped into tidy long form to create two new analytical
variables:
Teacher counts are scaled per 10 000 to make them visually comparable
to percentages.
We then:
Type × Level
combination.tendency_teachers.csv
for reproducibility.Reading tip:
Focus on whether the compensation share rises or falls
with the number of teachers.
Parallel upward trends may indicate sustained investment in
instructional personnel, while divergence could suggest funding
reallocation, changing class sizes, or reporting gaps across education
levels.
# 1) Read & filter (keep only teaching staff; drop qualification % series)
teacher_raw <- read_csv("EdStats_teacher.csv", show_col_types = FALSE) %>%
filter(!str_detect(`Indicator name`, regex("non-?teaching staff", ignore_case = TRUE))) %>%
filter(!str_detect(`Indicator name`, regex("percentage of qualified", ignore_case = TRUE)))
# 2) Long format (auto-detect year columns) + standardized labels
teachers_long0 <- teacher_raw %>%
pivot_longer(cols = matches("^\\d{4}$"), names_to = "Year", values_to = "Value") %>%
mutate(
Year = as.numeric(Year),
Type = case_when(
str_detect(`Indicator name`, regex("compensation", ignore_case = TRUE)) ~
"Compensation % of Total Expenditure",
str_detect(`Indicator name`, regex("teachers? in|number of teachers", ignore_case = TRUE)) ~
"Number of Teachers",
TRUE ~ NA_character_
),
Level = case_when(
str_detect(`Indicator name`, regex("\\bpre-?primary\\b", ignore_case = TRUE)) ~ "Pre-Primary",
str_detect(`Indicator name`, regex("\\bprimary\\b", ignore_case = TRUE)) ~ "Primary",
str_detect(`Indicator name`, regex("lower\\s*secondary", ignore_case = TRUE)) ~ "Lower Secondary",
str_detect(`Indicator name`, regex("upper\\s*secondary", ignore_case = TRUE)) ~ "Upper Secondary",
str_detect(`Indicator name`, regex("\\bsecondary\\b", ignore_case = TRUE)) ~ "Secondary (unspecified)",
str_detect(`Indicator name`, regex("\\btertiary\\b|post-?secondary|higher education", ignore_case = TRUE)) ~ "Tertiary",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(Type), !is.na(Level), !is.na(Value))
# 3) Collapse duplicates within (Year, Level, Type) to ensure 1 row per cell
teachers_long <- teachers_long0 %>%
group_by(Level, Year, Type) %>%
summarise(Value = mean(Value, na.rm = TRUE), .groups = "drop")
# 4) Scale teacher counts to 10,000s for visual comparability with percentages
teachers_long <- teachers_long %>%
mutate(Value_scaled = if_else(Type == "Number of Teachers", Value / 10000, Value))
# (Optional) order levels for consistent faceting
lvl_order <- c("Pre-Primary", "Primary", "Lower Secondary", "Upper Secondary",
"Secondary (unspecified)", "Tertiary")
teachers_long <- teachers_long %>%
mutate(Level = factor(Level, levels = intersect(lvl_order, unique(Level))))
# ---- A. Time-series trends by level (scaled) ----
p_line_teacher <- ggplot(teachers_long, aes(x = Year, y = Value_scaled, color = Type)) +
geom_line(linewidth = 1.1, na.rm = TRUE) +
geom_point(size = 1.7, alpha = 0.8) +
facet_wrap(~ Level, ncol = 3, scales = "fixed") +
theme_minimal(base_size = 12) +
theme(strip.background = element_rect(fill = "grey95", color = NA)) +
scale_color_manual(
values = c("Number of Teachers" = "#FF69B4",
"Compensation % of Total Expenditure" = "#3BBF8F"),
labels = c("Number of Teachers (per 10,000)", "Compensation (% of total)")
) +
labs(
title = "Teachers & Compensation Trends by Level (USA)",
y = "Scaled value (Teachers per 10,000; Compensation %)",
color = "Series"
) +
scale_x_continuous(breaks = pretty(unique(teachers_long$Year)))
p_line_teacher
# Interactive view (kept optional for HTML)
ggplotly(p_line_teacher)
# ---- B. Coverage diagnostics (counts by Type / Year / Level) ----
p_count_type <- ggplot(teachers_long, aes(x = Type)) +
geom_bar(fill = "#c562f0") +
theme_minimal(base_size = 12) +
labs(title = "Coverage: Count of Observations by Type", x = "Series", y = "Count")
p_count_type
p_count_year <- ggplot(teachers_long, aes(x = factor(Year))) +
geom_bar(fill = "#c562f0") +
theme_minimal(base_size = 12) +
labs(title = "Coverage: Observations per Year (all levels)", x = "Year", y = "Count")
p_count_year
p_count_level <- ggplot(teachers_long, aes(x = Level, fill = Type)) +
geom_bar(position = "dodge") +
theme_minimal(base_size = 12) +
labs(title = "Coverage: Observations by Level and Type", x = "Level", y = "Count", fill = "Series")
p_count_level
# ---- C. Correlation between staffing and compensation (where both exist) ----
# Wide table on the *scaled* values so axes are comparable
wide_tc <- teachers_long %>%
select(Level, Year, Type, Value_scaled) %>%
pivot_wider(names_from = Type, values_from = Value_scaled)
# overall (pooled) correlation with guard
corr_tbl <- tibble::tibble()
valid_overall <- wide_tc %>%
mutate(valid = complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)) %>%
filter(valid)
if (nrow(valid_overall) >= 3) {
overall_r <- cor(valid_overall$`Number of Teachers`,
valid_overall$`Compensation % of Total Expenditure`,
use = "complete.obs", method = "pearson")
corr_tbl <- bind_rows(corr_tbl, tibble(Level = "All levels (pooled)", r = overall_r))
}
# by-level correlations (no cur_data(); compute within each group)
by_level_r <- wide_tc %>%
group_by(Level) %>%
summarise(
n_pairs = sum(complete.cases(`Number of Teachers`, `Compensation % of Total Expenditure`)),
r = ifelse(
n_pairs >= 3,
cor(`Number of Teachers`, `Compensation % of Total Expenditure`,
use = "complete.obs", method = "pearson"),
NA_real_
),
.groups = "drop"
) %>%
filter(!is.na(r))
corr_tbl <- bind_rows(corr_tbl, by_level_r %>% select(Level, r))
corr_tbl
## # A tibble: 6 × 2
## Level r
## <chr> <dbl>
## 1 All levels (pooled) -0.313
## 2 Pre-Primary -0.571
## 3 Primary 0.551
## 4 Lower Secondary -0.830
## 5 Upper Secondary -0.862
## 6 Secondary (unspecified) -0.334
# ---- D. Summary statistics (per Type × Level) ----
summary_stats_teachers <- teachers_long %>%
group_by(Type, Level) %>%
summarise(
n = n(),
mean_value = mean(Value, na.rm = TRUE),
median_value = median(Value, na.rm = TRUE),
sd_value = sd(Value, na.rm = TRUE),
min_value = min(Value, na.rm = TRUE),
max_value = max(Value, na.rm = TRUE),
IQR = IQR(Value, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(Type, Level)
write_csv(summary_stats_teachers, "tendency_teachers.csv")
summary_stats_teachers
## # A tibble: 12 × 9
## Type Level n mean_value median_value sd_value min_value max_value
## <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Compensatio… Pre-… 6 49.0 49.0 6.83e-1 48.1 49.8
## 2 Compensatio… Prim… 6 49.0 49.0 6.83e-1 48.1 49.8
## 3 Compensatio… Lowe… 6 49.0 49.0 6.83e-1 48.1 49.8
## 4 Compensatio… Uppe… 6 49.0 49.0 6.83e-1 48.1 49.8
## 5 Compensatio… Seco… 6 46.8 46.7 9.30e-1 46.1 48.7
## 6 Compensatio… Tert… 6 28.5 28.3 7.90e-1 27.6 30.0
## 7 Number of T… Pre-… 12 531226. 608720. 1.40e+5 302255 651997.
## 8 Number of T… Prim… 17 1604827. 1687937 1.49e+5 1414000 1774348.
## 9 Number of T… Lowe… 14 810580. 858595. 8.81e+4 670172 894725.
## 10 Number of T… Uppe… 14 774670. 813548. 7.75e+4 650603 848440.
## 11 Number of T… Seco… 15 1501112. 1661375 2.95e+5 784414. 1737206.
## 12 Number of T… Tert… 22 853184. 747500 3.14e+5 574000 1581424
## # ℹ 1 more variable: IQR <dbl>
These observations are preliminary and will guide further analysis.
Reproducibility Notes
All intermediate datasets and figures are written to disk
(EdStats_USA.csv, EdStats_attend.csv,
EdStats_teacher.csv, and exported PNGs).
This allows others to review and re-run later steps without repeating
earlier processing.
Consider adding sessionInfo() at the end to document
package versions.
reference string
reference string
reference string